A graph-based data quality analysis in distributed telemedicine systems

Telemedicine is one of the most rapidly developing areas of healthcare and it plays an increasing role in modern medicine. As the amount of data and demand for features increase, the data paths are becoming ever-more complex. Owing to this, it is vital in telemedicine to find a proper balance between consistency and availability under any given circumstances. However, making a trade-off can significantly influence the quality of the data. This study seeks to get an in-depth view of the problem by considering a real-world telemedicine use-case and elaborating the formal system specification of the scenario. After evaluating the specification, the constructed state graph is examined using graph coloring and other graph algorithms.

Telemedicine is one of the fast developing areas of medicine.Today, more and more data items in a healthcare system are handled electronically, stored in the cloud and can be shared readily with other systems.Telemedicine is a comprehensive area of healthcare that concerns many specialties.Due to them, telemedicine systems have to be designed so as to be easily integrated with other system.There are various techniques available for having secure access to external systems, like standards, Application Programming Interfaces (API), portable server instances.There can also be self-developed parts, caches, Content Delivery Networks (CDN) in a system that can raise the level of availability.Server-side computational units are also frequent because they are responsible for unburdening the client-side by performing resource-intensive tasks.Sometimes the closeness of data is essential, so computation tasks are outsourced to edge devices [1].Although telemedicine systems are usually viewed as simple client-server architecture-based systems, the reasons and solutions mentioned above can lead to very complex data paths.Figure 1 shows a schematic form of a real-world telemedicine system that shows how complicated the data paths can be in distributed telemedicine systems.However, the set of data portions and aggregations may produce a result table that is visible for a patient or a practitioner.
Since telemedicine applications are mostly web-based, a significant number of requests have to be performed simultaneously on the server-side.However, huge amounts of data and a lot of computational tasks are present.In order to serve so many requests, a distributed system is necessary.Besides the advantages of distributed systems, there are some disadvantages as well.Eric Brewer's theorem about Consistency, Availability and Partition tolerance (CAP) [2] states that there are no distributed systems that can guarantee at most two of the three desirable properties.The extension of the CAP Theorem states that in the case of network Partitioning (P) a trade-off has to be made between Availability (A) and Consistency (C), but Else (E), when the system is running normally in the absence of partitions, another trade-off has to be made between Latency (L) and Consistency (C).This extension is the so-called PACELC theorem [3].Both theorems assert that the availability and consistency cannot be guaranteed simultaneously at 100%.
Additionally, the data goes through many stages, so the round-trip and the production of the finally visible data takes some time.These latencies are external factors that are always present in these systems and play important roles.Latency not only strongly influences the availability, but also the consistency and data quality [4].
In this paper, the following results are explained in distributed telemedicine systems.
The formal modeling of a concrete telemedicine system that operates today and the evaluation of the system specification via model checking: the system specification consists of a client, a distributed database system, computational units and a cache.The correctness of system model is verified with a model checker and the state graph is dumped into graph files; Visualizing the new metrics for reliability of data: the model checking results in a state graph file that contains the simulations under different circumstances grouped into graph components.The nodes of the components contain information concerning the Quality of Data (QoD) limited by caching strategies and latency values; Analyzing the state graph of the system model with graph theoretical algorithms: the structure of the state graph components contains graph theoretically relevant information, and they can be applied for clustering.

SYSTEM MODELING APPROACH
In telemedicine systems, availability and consistency are both important, and it is hard to make a trade-off between them, because different telemedicine use-cases need specific configurations.
It is found that measuring the consistency level in a distributed system is not trivial.Finding the proper metrics and measurement techniques are essential.The Probabilistically Bounded Staleness (PBS) is a promising method that was presented by Peter Bailis et al. [5].It shows how much time has to elapse for eventual consistency in quorum-based distributed database systems.In their study, t-visibility and k-staleness metrics describe the trade-off between availability and consistency.Operation latency is described with 4 latency values, these are write request to replica, replica write acknowledgement, read request to replica and replica read response latencies.This is the so-called WARS model.Their results were obtained by Monte Carlo simulations, and good approximations can be achieved.However, this approach cannot be used for evaluating whole telemedicine systems.
Simple simulation is not satisfactory because some parts of the state space can be ignored due to randomization, so a modeling approach was chosen.Formal modeling is a widely spread technique for verifying the correctness of systems [6].Amazon and Microsoft have already used the Temporal Logic of Actions (TLA) and its TLAþ formal language [7] for creating specification about their distributed systems [8,9].During the model checking, they found several serious bugs that had not come up before.After studying their approach and system specifications, several telemedicine systems were modeled [10], and it was shown that an easily tunable system can assist the design of information critical heterogeneous systems.
Making the trade-off between availability and consistency has notable effects on the QoD.Data quality can be measured in different ways.Its application greatly depends on the type of dataset and the context.In telemedicine systems, the most rapidly changing data portions are numeric data sets.QoD calculations are usually based on a distance function and an aggregation that describes inconsistencies between the real-world phenomena and the data obtained from resources.Hinrichs' formula stated in Eq. ( 1) describes what QoD means in the context of telemedicine, where x db represents the data stored in a database and x real stands for real-world data at a given t point [11], 1; otherwise: (1)

TELEMEDICINE USE-CASE
A former study [10] revealed the importance of availability and consistency in information critical heterogeneous systems.This paper presents a concrete, active telemedicine use-case maintained by Inclouded [12], through which the formal system modeling and a new graph-based evaluation technique are performed.The selected project concerns patients that have been diagnosed with metabolic syndrome.They have high blood pressure along with high fasting glucose levels and abdominal obesity that can lead to cardiovascular disease [13].Hence, different types of vital signs are measured, and often simultaneously.All the measurements go through a similar data path that includes a sensor and a mobile client that collect the raw data and send them to the cloud.These devices are usually in the patient's home.In the cloud, there is a distributed database and computational units that are responsible for persistence and performing resource-intensive tasks.There are also other computational tasks that depend on earlier aggregations and these can make the data path more complicated.The result is available at different places: it is stored in the database, but it may also exist in the cache.The request for the final result is performed by a Web client that is controlled by a doctor or a nurse.
Here, the patient's 24-h long electrocardiography measurement is taken from the project of metabolic syndrome.In this scenario, the raw data are sent to the database, and the computational unit calculates the Q-wave, R-wave, S-wave (QRS) interval [14], and it sends back the result to the database.

Formal specification
Firstly, the fact that an approximation does not examine the whole state space is taken into account, so the chosen methodology for system verification is system modeling with formal tools.In TLAþ, a complete system specification can be made with its own syntax.In the system spec, the main processes are defined; that is, client operations (Client Write (CW), Client Read (CR), Client Read from Cashe (CRC)), DataBase Write (DB_W) for persistence Data Base PRO-Cessing (DB_PROC)) for aggregation.Both database persistence and aggregation are performed by distributed systems, so multiple server instances are initiated.In order to increase the availability of the system, the read operation of the client is separated into two parts, namely sending requests to the database and sending requests to the cache.Furthermore, to make the system easily tunable, the caches are configurable with the k-staleness parameter derived from the PBS method.Besides k-staleness, the latency is also taken into account.Code 1 shows the formal definition of the CW process.The original specification was written in PlusCal, but the TLAþ toolbox converts the spec using the TLAþ syntax.

Simulation environment
The verification of the system spec can be performed by a model checker.The TLAþ toolbox has a built-in model checker, called TLA Checker (TLC).It constructs state graphs and evaluates them via graph traversals.The result is the diameter of the graph, the number distinct states and the total number of states found.
In order to terminate the model checking, it is necessary to set up a threshold that limits the size of the state graph.Here, the threshold is given by the maximum allowed number of write operations and it is set to 5. Also, the latency is restricted to the ½0 . . . 5 interval because the state space is rapidly growing by increasing the interval by 1. 4 different latency types are taken into account, there being latency for client write, aggregation, client read from database and client read from cache.It is stated in [15] that there are huge differences among different computer actions.In the model checking only Central Processing Unit (CPU) and Random Access Memory (RAM) are used, and the RAM access takes the most of the time in the calculation of a new state in the graph.So, a new state of the graph can be generated within 100 ns.The significant amount of latency is caused by the network.It is also known that a network connection is almost 10,000,000 times slower than accessing the RAM [16], so increasing the latency by 1 means approximately 100 ms delay in our simulation environment.A delay between 0 and 500 ms can be valid for all the units in the system.The data used for simulation was obtained from the MIMIC-III Waveform Database [17][18][19][20] and transferred to integers in order to work with them in TLAþ.The cache is configured with the k-staleness parameter that uses values from 0 to 3. If k ¼ 0, the client tries to obtain the most up-to-date data.The higher the k-parameter, the more tolerance is added to the system for the staleness of the data.

Evaluation of state graph
TLC produced 4 variants of state graphs because of the 4 given k-staleness parameter values.Each graph file is 12 GB and dumped in dot format.Dot is the basic file extension for the Graphviz [21] library that is able to visualize and process graphs.On the one hand, the whole state space of the graphs consists of 335,409 nodes and 664,587 edges, and it is not understandable in one piece.On the other hand, this huge graph is hard to fit in memory and it would produce an image file with a similar size.A tile system [22] could solve the visualization problem, but Graphviz cannot make this conversion at this point.
The original dot files contain long labels describing the current state of the system.If one or two values are only investigated, a significant number of bytes can be dropped and the size of the files can be considerably reduced.This study focuses on the QoD in the project of metabolic syndrome, so only QoD values are kept.These values are the labels of the nodes that represent the possible states of the system.In order to make the graphs clearer, QoD values are converted to Red, Green, Blue (RGB) colors that are useful during visualization.
Hence, nodes have only an identifier and a fill color.Doing this, a file-size reduction of 75% was achieved and the original file size was cut to 3 GB.After reading GML files, NetworkX found the weakly connected components in the graph.Each weakly connected component represents a simulation performed by TLC under given circumstances.Lastly, every weakly connected component can be dumped into separate GML files to make further examination and visualization easier.

Evaluation of state graphs
TLC produced state spaces with more than 280,000 components.After checking the content of the variables in each state, it was found that the evaluation order of the time windows in QRS interval calculations is non-deterministic due to the multiple computational instances.This issue was identified and fixed in the system.
The graph components produced by TLC can have different sizes and various shapes depending on the number of states that can be reached from the initial ones.At first glance, these graphs appear to be acyclic, but after a quick graph analysis, it turns out that the original graph files contain cycles.These cycles are caused by self-loop edges at the leaves of the graphs.TLC adds edges to the graph depending on the next executed TLAþ process.The selfloop edges at leaves are added to indicate the termination.Since TLC did not find any deadlocks and errors, every execution of processes terminated and worked properly.
After removing terminating self-loop edges, it was found these graphs had a Directed Acyclic Graph (DAG) [24] structure.Since DAG is also feasible for describing data process networks [25], it is a suitable structure for characterizing distributed telemedicine systems as well.If the entire system is modeled, all the weakly connected components will have a DAG structure because the whole history of the system is kept, and it is impossible to have an edge to a node that has already been visited.The density of these DAGs is less than 0.3, so they are called sparse graphs.

Critical paths in information critical heterogeneous systems
The methodology and the elaborated system model show whether there are executions that lead to drastic reduction in QoD.This information can help limit the caches and latencies in order to get the required level of QoD.
DAGs carry lots of information within themselves.If DAGs are adapted to distributed telemedicine systems, topological ordering and the longest paths can present those paths and nodes that lead to critical operations [26].Topological ordering returns an order of events in a system in which the system works properly or when the system does not work as expected.
Finding the longest paths in arbitrary graphs is a Nondeterministic Polynomial-time hard (NP-hard) problem, but it can be carried out in linear time if the graph is a DAG [27].The longest path in a DAG can be used to find a critical path that can lead to inconsistency in the system or the system can go into unexpected states that may result in a lower QoD.
In Fig. 2, axis X stands for the length of the possible critical paths that group DAGs.QoD measurements were grouped by the longest paths found in DAGs.In this simulation environment, the possible longest path is 24 in the biggest components.However, there are some components that have only a 2-step-length longest path.Based on the topological ordering and finding the longest path algorithms, after examining the components, it is found that the data quality starts to straighten out after the point where the latencies start to have similar values (when the longest path is above 13).Since computational unit works as a trigger, with no further restrictions, consistent data can only be guaranteed if data arrival is slower than processing.Figure 2 shows this phenomenon in peaks.If the delay of persistence is increased while other processes left unchanged, a rapid

QoD-based graph clustering
There are well-known and well-tried clustering techniques for finding similar objects and they can be applied to graph components as well [28].After seeing how the QoD changes if the longest path in DAG is increased, graph components still cannot be clustered because of their cardinality.Several graph editor tools contain clustering methods and algorithms, but none of them found clear similarities among the components.It is found that if components had the same QoD values in leaves, they had the same structure.Therefore, after grouping the components of the whole state space by taking into account the QoD values in leaves, about 4,000 clusters were created.Some clusters contain 2 or 3 components, but others have thousands of graphs.
Figure 3 shows 3 different graphs that were obtained by this clustering method.These were created via the Graphia [29] visualization tool.Black nodes are root nodes of components that represent the 5 defined TLAþ processes.The white color means that the QoD value is 100% in the given state, but in the gray leaves, the QoD is reduced to 25%.The number in the upper-left corner stands for the identifier of the component in the state space.Table 1 lists the whole clustering results in numerical terms.It can be seen that many different executions of the system result in the same graph.
With this technique, not only the similar graph components can be grouped, but also the separate components can be dumped into graph files and visualization can be performed using low performance applications and computers.All in all, this clustering technique is the most helpful in the reduction of complexity in enormous graph spaces.

CONCLUSIONS
In this study, it is found that formal modeling and model checking can achieve a complete simulation of a real-world telemedicine system.At this level of abstraction, the methodology pointed out phenomena that cannot be observed on the basis of knowing the system.The created state space can be described by a giant graph that has thousands of weakly connected components.Due to the size of the state space, the complexity of the graph must be reduced in order to obtain valuable graph analytical results.Grouping the components of this system graph by data quality measurements seems to be an appropriate clustering technique that makes visualization and analysis easier and clearer.All the weakly connected components in a telemedicine system graph have a DAG structure, and this composition makes many graph algorithms feasible in linear time.Based on the length of the longest paths in DAGs, critical paths can be found, and it was also shown in which component sizes the highest QoD is most likely to occur.In the future, it is planned to extend these graph analytical techniques and examine other telemedicine use-cases as well.

Fig. 1 .
Fig. 1.Expected structure of telemedicine systems versus data paths in real telehealth systems

Fig. 2 .
Fig. 2. The changes in QoD depending on the longest path in DAG

Table 1 .
Number of weakly connected components and generated clusters with different k-staleness parameter values